128 research outputs found

    A new unsupervised feature selection method for text clustering based on genetic algorithms

    Get PDF
    Nowadays a vast amount of textual information is collected and stored in various databases around the world, including the Internet as the largest database of all. This rapidly increasing growth of published text means that even the most avid reader cannot hope to keep up with all the reading in a field and consequently the nuggets of insight or new knowledge are at risk of languishing undiscovered in the literature. Text mining offers a solution to this problem by replacing or supplementing the human reader with automatic systems undeterred by the text explosion. It involves analyzing a large collection of documents to discover previously unknown information. Text clustering is one of the most important areas in text mining, which includes text preprocessing, dimension reduction by selecting some terms (features) and finally clustering using selected terms. Feature selection appears to be the most important step in the process. Conventional unsupervised feature selection methods define a measure of the discriminating power of terms to select proper terms from corpus. However up to now the valuation of terms in groups has not been investigated in reported works. In this paper a new and robust unsupervised feature selection approach is proposed that evaluates terms in groups. In addition a new Modified Term Variance measuring method is proposed for evaluating groups of terms. Furthermore a genetic based algorithm is designed and implemented for finding the most valuable groups of terms based on the new measure. These terms then will be utilized to generate the final feature vector for the clustering process . In order to evaluate and justify our approach the proposed method and also a conventional term variance method are implemented and tested using corpus collection Reuters-21578. For a more accurate comparison, methods have been tested on three corpuses and for each corpus clustering task has been done ten times and results are averaged. Results of comparing these two methods are very promising and show that our method produces better average accuracy and F1-measure than the conventional term variance method

    Robust and cost-effective approach for discovering action rules

    Get PDF
    The main goal of Knowledge Discovery in Databases is to find interesting and usable patterns, meaningful in their domain. Actionable Knowledge Discovery came to existence as a direct respond to the need of finding more usable patterns called actionable patterns. Traditional data mining and algorithms are often confined to deliver frequent patterns and come short for suggesting how to make these patterns actionable. In this scenario the users are expected to act. However, the users are not advised about what to do with delivered patterns in order to make them usable. In this paper, we present an automated approach to focus on not only creating rules but also making the discovered rules actionable. Up to now few works have been reported in this field which lacking incomprehensibility to the user, overlooking the cost and not providing rule generality. Here we attempt to present a method to resolving these issues. In this paper CEARDM method is proposed to discover cost-effective action rules from data. These rules offer some cost-effective changes to transferring low profitable instances to higher profitable ones. We also propose an idea for improving in CEARDM method

    SVM categorizer: a generic categorization tool using support vector machines

    Get PDF
    Supervised text categorisation is a significant tool considering the vast amount of structured, unstruc-tured, or semi-structured texts that are available from internal or external enterprise resources. The goal of supervised text categorisation is to assign text documents to finite pre-specified categories in order to extract and automatically organise information coming from these resources. This paper pro-poses the implementation of a generic application – SVM Categorizer using the Support Vector Ma-chines algorithm with an innovative statistical adjustment that improves its performance. The algo-rithm is able to learn from a pre-categorised document corpus and it is tested on another uncatego-rized one based on a business intelligence case study. This paper discusses the requirements, design and implementation and describes every aspect of the application that will be developed. The final output of the SVM Categorizer is evaluated using commonly accepted metrics so as to measure its per-formance and contrast it with other classification tools
    • …
    corecore